[% setvar title Brace-matching for Perl Regular Expressions %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Brace-matching for Perl Regular Expressions</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Eric J. Roode &lt;<a href='mailto:eric@myxa.com'>eric@myxa.com</a>&gt;
  Date: 24 Aug 2000
  Last Modified: 25 Aug 2000
  Mailing List: <a href='mailto:perl6-language-regex@perl.org'>perl6-language-regex@perl.org</a>
  Number: 145
  Version: 2
  Status: Developing</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>It is quite difficult to match paired characters in Perl 5 regular
expressions. A solution is proposed, using new \m (match opening grouping
character) and \M (match closing grouping character) metacharacters.
A new compiler directive, &quot;use matchpairs&quot; controls which strings
are considered grouping characters and what their complement is.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>A new regular expression metacharacter \m would match any of the
following characters: ([{&quot;'&lt; in a regexp. A later \M metacharacter
would match the corresponding closing pair character )]{&quot;'&gt; <i>at the
same nesting level</i> within the string being searched.</p>
<p>For example,
$string = &quot;([b - (a + 1)] * 7)&quot;;
$string =~ /\m.*?\M/;</p>
<p>The \m would match the first open parenthesis, the .*? would match the
substring &quot;[b - (a + 1)] * 7&quot;, and the \M would match the second close
parenthesis.</p>
<p>Used within a character class (square brackets in a regular expression),
\m would match <i>any</i> opening grouping character, and \M would match <i>any</i>
closing grouping character. Thus, [^\m\M]* would return a span of non-
grouping characters.</p>
<p>What exactly is matched by \m and \M is controlled by a new compiler
directive, &quot;use matchpairs&quot;. The default is:</p>
<pre>    use matchpairs ('('=&gt;')', '{'=&gt;'}'; '['=&gt;']', '&quot;'=&gt;'&quot;', 
                    &quot;'&quot;=&gt;&quot;'&quot;, '&lt;'=&gt;'&gt;');</pre>
<p>One could restrict the sets for various parsing situations:
Block:
{
use matchpairs ( '(' =&gt; ')' );
...
}</p>
<p>Or one could come up with completely new pair sets:
Block:
{
use matchpairs ( '/*' =&gt; '*/', '&quot;' =&gt; '&quot;' );
$c_code_snippet =~ /\m(.*?)\M/;  # Find string or comment
}</p>
<p>It is a compile-time error to specify an odd number of elements in
a &quot;use matchpairs&quot; directive.</p>
<p>Note: &quot;use matchpairs&quot; is a scope-limited compile-time directive.
A regular expression uses whichever matchpairs were in effect at the time
it was compiled. So:</p>
<pre>    my $pat;
    {
        use matchpairs ( '{' =&gt; '}' );
        $pat = qr/\m(.*?)\M/;
    }
    $data =~ /$pat/;    # Only matches curly braces</pre>
<p>Footnote: Someone has suggested &quot;use re 'pairs' LIST;&quot; as the directive
instead of &quot;use matchpairs LIST;&quot;. Comments?</p>
<a name='EXAMPLES'></a><h1>EXAMPLES</h1>
<p># 1. Recursive processing of nested groupings
sub parse
{
my $string = shift;
while ($string =~ /([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g)
{
my ($pre, $quote, $mid, $endquote, $post) = ($1,$2,$3,$4,$5);
process ($pre);
parse ($mid);   # Note recursion
process ($post);
}
}</p>
<p># 2. Greedy vs non-greedy matching:</p>
<pre>    $string = '(a + b) * (c + d)';
    $string =~ /\m(.*)\M/;    # Greedy. $1 is 'a + b) * (c + d'.
    $string =~ /\m(.*?)\M/;   # Non-greedy. $1 is 'a + b'.</pre>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<p>Disclaimer: I know little about Perl internals, particularly the RE engine.</p>
<p>When the RE engine encounters a \m, it should match if it finds an open
grouping character. From that point forward, it should maintain an internal
count of &quot;like&quot; open and close grouping characters, When it encounteres a
\M metacharacter, it should match if it finds a closing grouping character
of the same sort (ie, the complement to the specific string that was matched
by the \m), <i>and</i> if the nesting level of that pair is zero.</p>
<p>So, in parsing the string &quot;(abc[def](ghi)jkl)&quot; with the RE /\m(.*?)\M/:</p>
<pre>    First, \m matches &quot;(&quot;. The engine remembers that it is looking for &quot;(&quot; and
    its complement &quot;)&quot;.

    Next, as it processes .*?, scanning the string, it encounters the &quot;[&quot;
    between the &quot;c&quot; and the &quot;d&quot;. It ignores it (it has no effect on the 
    nesting-level count, since it is not &quot;(&quot; or &quot;)&quot;).

    As it continues scanning, it encounters the &quot;]&quot; between the &quot;f&quot; and the
    &quot;)&quot;. The \M does not match this &quot;]&quot; character, because the \M must match
    a &quot;)&quot;.

    Next, as it continues processing the .*?, it enounters the &quot;(&quot; between
    the &quot;]&quot; and the &quot;g&quot;. This does match the current grouping set, so the
    engine increments the nesting level to 1.

    Next, it encounters the &quot;)&quot; between the &quot;i&quot; and the &quot;j&quot;. The \M does not
    match this &quot;)&quot;, because the nesting level is not zero. Having encountered
    the &quot;)&quot;, however, the engine decrements the nesting level to 0.

    Finally, it encounters the &quot;)&quot; after the &quot;l&quot;. This one does match the \M
    because the nesting level is 0.</pre>
<p>Nested \m\M pairs present a problem in that the engine must remember which
pair of grouping characters it is looking for. Example:</p>
<p>Parse the string &quot;(abc[def](ghi)jkl)&quot; with the RE /\m(.*?)\m(.*?)\M(.*?)\M/:</p>
<pre>    \m matches &quot;(&quot;. Engine remembers that this \m corresponds to &quot;(&quot;, &quot;)&quot;.

    .*? matches &quot;abc&quot;.

    The second \m matches the &quot;[&quot;. The engine remembers that I&lt;this&gt; \m  
    corresponds to &quot;[&quot;, &quot;]&quot;.

    The second .*? matches &quot;def&quot;. 

    The \M matches &quot;]&quot;, since the second \m was for square brackets, and the
    nesting level for square brackets is zero.

    The third .*? matches &quot;(ghi)jkl&quot;. The &quot;)&quot; between the &quot;i&quot; and the &quot;j&quot; 
    does not match the \M pattern because, as in the first example, the
    nesting level is not zero.

    The second \M matches &quot;)&quot;.</pre>
<a name='PROBLEMS'></a><h1>PROBLEMS</h1>
<p>1. How should a \M without a prior \m be interpreted in a regular expression?
Probably should be a compile-time error:</p>
<pre>    \M without preceding \m in pattern at ...</pre>
<p>2. How should an expression like /\m?.*?\M/ be interpreted? Specifically, what
should the meaning of the \M be if the optional \m does not match?</p>
<a name='CHANGE HISTORY'></a><h1>CHANGE HISTORY</h1>
<pre>    v1  08/23/2000  Initial submission

    v2  08/24/2000  1. Change clumsy @^g, @^G configuration variables 
                       to &quot;use matchpairs&quot; directive.
                    2. Change \g\G metacharacters to \m\M.
                    3. Fix a couple typos.
                    4. Add the greedy vs non-greedy example.</pre>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>perlre perldoc page for general discussion of existing regexps</p>
<p>Mastering Regular Expressions book</p>
<p>Blue Camel, chapter 2.</p>
</div>
